Skip to content

feat(prow-job-executor): abort gating E2E job on rollout cancellation (AROSLSRE-1339)#259

Open
raelga wants to merge 3 commits into
Azure:mainfrom
raelga:raelga/aroslsre-1339-abort-gating-e2e
Open

feat(prow-job-executor): abort gating E2E job on rollout cancellation (AROSLSRE-1339)#259
raelga wants to merge 3 commits into
Azure:mainfrom
raelga:raelga/aroslsre-1339-abort-gating-e2e

Conversation

@raelga

@raelga raelga commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Jira: AROSLSRE-1339 · Sub-task: AROSLSRE-1349

Problem

When an EV2 gating rollout step is cancelled, prow-job-executor receives SIGTERM but previously just exited, leaving the postsubmit Prow E2E job running to completion. That wastes CI capacity and lets a job for an abandoned rollout keep reporting status.

Goal

Propagate the rollout cancellation to the gating Prow job: when the executor is cancelled, abort the E2E job it submitted, without ever touching a sibling region's job.

What changes

  • Abort the running job via Gangway BulkJobStatusChange (POST /v1/bulk-job-status-update) — there is no per-execution abort endpoint.
  • Distinguish external cancellation (SIGTERM) from an internal monitor timeout in WaitForCompletion; only the former triggers an abort, sent with context.WithoutCancel + a 30s budget so it survives the shutdown grace period.
  • Region-aware uniqueness guard: one EV2 pipeline fans out to several regional E2E jobs that share the same job_name and refs, and metav1.Time only serializes to one-second precision. Before aborting, list concurrent same-job/same-state executions and skip the abort (fail-safe) if any sibling shares the target's start-second.
  • New --abort-on-cancel flag (default true) to gate the behaviour.

Example

EV2 gating step cancelled
  -> executor receives SIGTERM
  -> WaitForCompletion sees parent ctx cancelled (not timeout)
  -> handleCancellation: list same-job RUNNING executions
       - sibling shares start-second? -> skip (fail-safe, no abort)
       - otherwise -> BulkJobStatusChange [StartTime, StartTime] -> ABORTED

Validation

From tools/prow-job-executor:

go build ./...
go vet ./...
go test ./...        # all packages ok, incl. new prowjob/abort_test.go
go tool golangci-lint run ./...   # 0 issues

New tests cover abort of a running job, skip-on-region-collision, abort when regions differ in time, the no-op terminal paths, monitor wiring, and the bulk-URL/terminal-state helpers.

Follow-ups

  • After this merges, bump the github.com/Azure/ARO-Tools/tools/prow-job-executor pseudo-version in ARO-HCP test/go.mod (consumed without a replace). Feature is on by default, so no ARO-HCP code change is needed.
  • Cross-env collisions (different job_name, same refs) are out of scope and documented as a residual; the guard only protects same-job_name regional siblings.

… (AROSLSRE-1339)

When the EV2 gating step is cancelled, the executor receives SIGTERM but
previously left the postsubmit Prow E2E job running to completion. Propagate
the cancellation by aborting the job via Gangway BulkJobStatusChange.

Because one EV2 pipeline fans out to several regional E2E jobs that share the
same job_name and refs, the bulk selector is region-blind and metav1.Time only
serializes to one-second precision. A region-aware uniqueness guard lists
concurrent same-job executions and skips the abort (fail-safe) if any sibling
region shares the target's start-second, so a sibling region is never aborted.

Gated behind --abort-on-cancel (default true).
Copilot AI review requested due to automatic review settings June 26, 2026 16:25

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates prow-job-executor so that when the executor is externally cancelled (e.g., EV2 rollout cancellation leading to SIGTERM), it makes a best-effort attempt to abort the corresponding gating Prow E2E job via Gangway’s bulk status-change API, while adding a safety guard to avoid accidentally aborting a sibling regional execution.

Changes:

  • Add cancellation-aware monitoring logic that distinguishes external cancellation from internal monitor timeout and triggers best-effort aborts on external cancellation.
  • Implement Gangway bulk abort support (BulkJobStatusChange) with an isolation guard based on concurrent executions and start-time second collisions.
  • Add --abort-on-cancel flag (default true) and introduce focused unit tests covering abort/skip paths and monitor wiring.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tools/prow-job-executor/prowjob/monitor.go Distinguishes parent cancellation vs internal timeout; triggers best-effort abort on cancellation.
tools/prow-job-executor/prowjob/client.go Adds bulk-abort support, derives bulk endpoint URL, lists executions, and implements isolation guard logic.
tools/prow-job-executor/prowjob/abort_test.go Adds unit tests for abort behavior, collision skip logic, and monitor cancellation behavior.
tools/prow-job-executor/options.go Adds --abort-on-cancel flag and wires it into executor monitor creation.
tools/prow-job-executor/go.mod Promotes protobuf dependency to direct requirement for protojson usage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tools/prow-job-executor/prowjob/monitor.go
Comment thread tools/prow-job-executor/prowjob/client.go
…ding

Address review feedback on the cancellation-abort path:

- handleCancellation now re-attaches the in-hand logger to the abort context
  via logr.NewContext. The client methods extract the logger with
  logr.FromContext and fail if it is absent, so the abort must not depend on
  the parent context carrying one. Added a regression test that aborts from a
  logger-less context.
- ListExecutions builds its request URL through net/url instead of string
  concatenation, so a --gangway-url that already carries a query string or
  trailing '?' no longer corrupts the request.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Comment thread tools/prow-job-executor/prowjob/client.go
Comment thread tools/prow-job-executor/prowjob/client.go Outdated
…fs to abort

Second-round review feedback:

- deriveBulkURL now preserves any base-path prefix on --gangway-url, swapping
  the /v1/executions route for /v1/bulk-job-status-update instead of replacing
  the whole path. A prefixed URL (host/gangway/v1/executions) no longer drops
  the prefix and 404s the abort.
- AbortJob refuses to abort when the job has no Spec.Refs. Refs are part of the
  bulk selector and the API cannot filter by job name, so a nil-refs abort could
  over-select. Skip instead.
- Added tests for the prefixed-URL derivation and the no-refs no-op.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Comment on lines +107 to +109
u.Path = strings.TrimSuffix(u.Path, executionsPath) + bulkStatusChangePath
u.RawQuery = ""
return u.String()
Comment on lines +294 to +315
isolated, err := c.abortWindowIsIsolated(ctx, logger, job, prowExecutionID)
if err != nil {
// If we cannot prove isolation we err on the side of caution and leave the
// job running rather than risk cancelling another region's execution.
logger.Error(err, "Could not verify the abort would be isolated to this execution; skipping abort")
return nil
}
if !isolated {
return nil
}

if job.Spec.Refs == nil {
// refs (org/repo) are part of the bulk selector and the API cannot filter
// by job name; aborting with nil refs could match a much broader set of
// jobs sharing only type/state. Refuse rather than risk over-selecting.
logger.Info("Job has no refs; skipping abort to avoid selecting unrelated jobs")
return nil
}
refs, err := prowgangway.FromCrdRefs(job.Spec.Refs)
if err != nil {
return fmt.Errorf("failed to convert refs for job %s: %w", prowExecutionID, err)
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants